Day 5: Deep Dive into Attention Mechanisms
The heart of the Transformer is Attention. Today we’ll fully understand everything from the meaning of Query/Key/Value to Multi-Head Attention and positional encoding by implementing them from scratch with numpy.
Intuitive Understanding of Query, Key, Value
Using a library analogy:
- Query (Q): What I’m looking for (the search query)
- Key (K): The title/tags of each book (the index)
- Value (V): The actual content of the book (the content)
We calculate the similarity between Q and K, and then retrieve V’s content in proportion to that similarity.
Scaled Dot-Product Attention Implementation
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (seq_len, d_k) - Query
K: (seq_len, d_k) - Key
V: (seq_len, d_v) - Value
mask: used in the decoder to hide future tokens
"""
d_k = K.shape[-1]
# Dot product then scale (large d_k leads to large dot products, making softmax extreme)
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
# Masking: prevent the decoder from seeing future tokens
if mask is not None:
scores = np.where(mask == 0, -1e9, scores)
# Generate probability distribution with Softmax
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
return np.matmul(weights, V), weights
# 4 tokens, 8 dimensions
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
# Causal mask (GPT-style: only attend to previous tokens)
causal_mask = np.tril(np.ones((seq_len, seq_len)))
print(f"Causal mask:\n{causal_mask.astype(int)}")
output, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"Attention weights:\n{weights.round(3)}")
Multi-Head Attention Implementation
def multi_head_attention(x, num_heads, d_model):
"""
Run multiple attention heads in parallel.
Each head learns different relationship patterns.
"""
d_k = d_model // num_heads
seq_len = x.shape[0]
outputs = []
for head in range(num_heads):
# Separate Q, K, V projection weights for each head
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_k) * 0.1
Q = np.matmul(x, W_q)
K = np.matmul(x, W_k)
V = np.matmul(x, W_v)
head_output, _ = scaled_dot_product_attention(Q, K, V)
outputs.append(head_output)
# Concatenate outputs from all heads
concatenated = np.concatenate(outputs, axis=-1)
# Final linear projection
W_o = np.random.randn(d_model, d_model) * 0.1
return np.matmul(concatenated, W_o)
d_model = 64
num_heads = 8 # 64 / 8 = 8 dimensions per head
x = np.random.randn(4, d_model)
output = multi_head_attention(x, num_heads, d_model)
print(f"Multi-Head Attention output: {output.shape}")
# Head 1: subject-verb relationships, Head 2: adjective-noun relationships, ...
Positional Encoding: Sinusoidal vs RoPE
import numpy as np
def sinusoidal_position_encoding(max_len, d_model):
"""Original Transformer positional encoding"""
pe = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe[:, 0::2] = np.sin(position * div_term) # Even dimensions: sin
pe[:, 1::2] = np.cos(position * div_term) # Odd dimensions: cos
return pe
def apply_rope(x, position):
"""RoPE (Rotary Position Embedding) - used in Llama, GPT-NeoX, etc."""
d = x.shape[-1]
freqs = 1.0 / (10000 ** (np.arange(0, d, 2) / d))
angles = position * freqs
# Rotate even/odd dimensions
cos_vals = np.cos(angles)
sin_vals = np.sin(angles)
x_even, x_odd = x[..., 0::2], x[..., 1::2]
rotated_even = x_even * cos_vals - x_odd * sin_vals
rotated_odd = x_even * sin_vals + x_odd * cos_vals
result = np.zeros_like(x)
result[..., 0::2] = rotated_even
result[..., 1::2] = rotated_odd
return result
# Verify sinusoidal positional encoding
pe = sinusoidal_position_encoding(max_len=10, d_model=16)
print(f"Positional encoding shape: {pe.shape}")
print(f"Distance between position 0 and 1: {np.linalg.norm(pe[0] - pe[1]):.3f}")
print(f"Distance between position 0 and 9: {np.linalg.norm(pe[0] - pe[9]):.3f}")
# Closer positions have more similar vectors
Attention learns “which token should attend to which other token.” Multi-Head performs this simultaneously from multiple perspectives, creating richer representations.
Today’s Exercises
- Explain mathematically why we scale by
np.sqrt(d_k). Experiment with d_k=64 and compare the softmax output with and without scaling. - Change the number of heads in Multi-Head Attention to 1, 4, 8, and 16, and observe how d_k changes. What problems arise when there are too many heads?
- Summarize the key differences between sinusoidal positional encoding and RoPE, and research why modern models prefer RoPE.